Exploratory Data Analysis on direct marketing campaign from Portuguese Bank

Abstract :

This exercise reflects what we have learned on Udacity’s Data Analyst Nano Degree, Exploratory Data analysis lesson. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y)

The data set is provided by the UC Irvine Machine Learning Repository [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Univariate Plots Section

In this section, we will will explore many variables and their distributions. The objective is to have a general understanding of the data presented in the data set.

Length & Structure

## [1] 41188

The data is divided in 21 variables

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

Input variables: # bank client data:
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”) 3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Data is grouped in 4 ‘types of data’ :

  1. Client Data - Demographic [age - education]
  2. Client Data - Banking [default - loan]
  3. Campaign data - [contact - poutcome]
  4. Macro social and economic Portuguese context [emp.var.rate - nr.employed]

The last one, named ‘y’ represent the success/failure of the marketing campaign.

“The client has purchased the service marketed”

Success rate is ~12% of the total observations

Client Data

We can quickly see that the Client data aspect of the data set is predominantly based on factor type of variables, except for the ‘age’ variable.

For each variable displayed, the count and proportions of the total are represented next to each other.

80% of observations are between 27 and 57 years old

60% of observations are married

‘hasloan’ variable is created in order to combine housing and loan data

Campaign Data

Variables in this section are categorical but also numerical.

‘pdays’ data is unusable, as the value 999 was used instead of N/A. We observe variability in the months variable although none in the day of week. This is due probably to a stronger launch than consequent following

The following plot represent the campaign variable, described as

number of contacts performed during this campaign and for this client (numeric, includes last contact)

Both distribution are heavily skewed towards the low values

Macro social and economic portuguese context data

The histograms show that the distribution for those variables are not ‘conventional’

Indeed we confirm that non regularity of the socio-economic variables
## 'data.frame':    41188 obs. of  22 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Ord.factor w/ 4 levels "single"<"married"<..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Ord.factor w/ 3 levels "yes"<"no"<"unknown": 2 3 2 2 2 3 2 3 2 2 ...
##  $ housing       : Ord.factor w/ 3 levels "yes"<"no"<"unknown": 2 2 1 2 2 2 2 2 1 1 ...
##  $ loan          : Ord.factor w/ 3 levels "yes"<"no"<"unknown": 2 2 2 2 1 2 2 2 2 2 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Ord.factor w/ 10 levels "mar"<"apr"<"may"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day_of_week   : Ord.factor w/ 5 levels "mon"<"tue"<"wed"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Ord.factor w/ 2 levels "yes"<"no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hasloan       : Ord.factor w/ 2 levels "yes"<"no/unknown": 2 2 1 2 1 2 2 2 1 1 ...
##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   single  :11568  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   divorced: 4612  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing     
##  university.degree  :12168   yes    :    3   yes    :21576  
##  high.school        : 9515   no     :32588   no     :18622  
##  basic.9y           : 6045   unknown: 8597   unknown:  990  
##  professional.course: 5243                                  
##  basic.4y           : 4176                                  
##  basic.6y           : 2292                                  
##  (Other)            : 1749                                  
##       loan            contact          month       day_of_week
##  yes    : 6248   cellular :26144   may    :13769   mon:8514   
##  no     :33950   telephone:15044   jul    : 7174   tue:8090   
##  unknown:  990                     aug    : 6178   wed:8134   
##                                    jun    : 5318   thu:8623   
##                                    nov    : 4101   fri:7827   
##                                    (Other): 4466              
##                                    NA's   :  182              
##     duration         campaign          pdays          previous    
##  Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
##  1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
##  Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
##  Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
##  3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
##  Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
##                                                                   
##         poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
##                      Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##                      3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                      Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                        
##    euribor3m      nr.employed     y               hasloan     
##  Min.   :0.634   Min.   :4964   yes: 4640   yes       :24133  
##  1st Qu.:1.344   1st Qu.:5099   no :36548   no/unknown:17055  
##  Median :4.857   Median :5191                                 
##  Mean   :3.621   Mean   :5167                                 
##  3rd Qu.:4.961   3rd Qu.:5228                                 
##  Max.   :5.045   Max.   :5228                                 
## 

What is/are the main feature(s) of interest in your dataset?

The main feature of this data set is that it clearly represents the information produced during a business process : A direct marketing campaign. Business driven, we can perceived a success/failure variable that will be key to analyse the other variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Besides the success/failure variable, we are provided with 3 blocks of data :

  1. Client data
  2. Campaign data
  3. Macro socio-economic data

We will investigate how success variable is related to those blocks, and if one is more significant in the correlation with a success event

Did you create any new variables from existing variables in the dataset?

Yes, We created a new variable that ‘merges’ the information contained in housing and loan variables. The objective is to have one variable that aggregate the binary condition of having a loan.
The new variable is called ‘hasloan’. Value ‘yes’ if housing == ‘yes’ OR loan == ‘yes’

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, as seen previously, the Macro socio-economic block of data has a rather unusual distribution. I did not perform any operation on these variables as I have no experience dealing with this type of data. Nothing seems to point to a corrupted/flawed data input.

Regarding the duration variable, we can apply the log transformation in order to make it as normal as possible. Although after lots of reading, and still not understanding it completely, we created a blog post that helped a lot. In it, we understood that the log transformation, besides making the plot look better, helps making a future linear regression model more performance.

Bivariate Plots Section

Now that we have a good understanding of each variable of its own, lets start building relationships. Our focus of investigation will trying to understand what variables are significantly correlated with the ‘y’ variable, indicating success in the marketing campaign.

After we will look for relationships outside of the y scope

Age vs y

Variable age does seem to have a tendency regarding the Y variable. However, when looking both plots one above the other on the same x axes, we can see that changes in Y proportion (success/failure) occur also when the frequency count of the each age bin change (bin width=1yr, IE ~100 observations for age 16, ~1750 observations for age 35 ).

The more observation in any given bin, the less Y (sucess) proportion

Let’s plot it in a classic box plot

Nothing catch our attention, except a larger IQR in the Yes (success) observation. Lets make use of the library GGally to plot the relations between variables, trying to always relate to the output variable Y.

GGpairs is no magic. After the creation of the matrices, little is uncover really. Maybe it is because of the categorical nature of the y variable ? We’ve purposely reduce the combinations possible to avoid having a 21x21 unreadable matrix. Lets get deeper for the variables that have shown significant differences for the Y(success/failure) variable.

‘Macro social and economic portuguese context data’ vs y : ex euribor3m

We can observe definitely a difference on the mean of each group.

## [1] 1.266 4.857
## [1] "Median Euribor3m for group of y = yes 1.266"
## [1] "Median Euribor3m for group of y = no 4.857"

Let’s remember that variable distribution

OK so we seem to have something here. Let find out what Euribor is.

Why is Euribor important? The Euribor rates are important because these rates provide the basis for the price or interest rate of all kinds of financial products, like interest rate swaps, interest rate futures, saving accounts (see: Euribor and savings) and mortgages. (link)

So bank clients accept the deposit offer when Euribor rates are at their lowest (of the sample). That doesn’t make sens from this point of view as we would suggest that higher Euribor rates > higher deposit returns. Further research is due to understand the correlation between variables.

Is this finding applicable to all macro economic variables in this data set ?

Campaign vs y

We now want to digg deeper on the “campaign variable”. Let’s remember it’s meaning :

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

Intuitively, we would believe that an increment in the number of calls increase the chance of success, as there is longer time of marketing power to convince the client. Let’s see.

So we observe that there is a correlation (but negative one!), the more contacts (calls) a client has the less he is willing to convert (Y variable value ‘yes’). We deduce that taking number of call as a proxy for ‘duration in contact with marketing agents’ might be not accurate

Lets test with the actual ‘duration variable’. This one should give interesting insights as we are warned

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

So we find a clear tendency of improving success rates whit increments in the duration of the call. As we are warned, there is no much use of this insight as is because duration is a value known AFTER the call is ended. so no prediction can be done based only in this parameter.

Hasloan vs y

Now lets try to check if there is a relation between having a loan and subscribing to a term deposit : hasloan vs y variables. Remember, hasloan is a variable created in the uni variate analysis. It combines the info of

6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

It seems that not. We wish to show those percentages

## Source: local data frame [4 x 4]
## Groups: hasloan [2]
## 
##      hasloan     y     n rel.freq
##        <ord> <ord> <int>    <chr>
## 1        yes   yes  2781    11.5%
## 2        yes    no 21352    88.5%
## 3 no/unknown   yes  1859    10.9%
## 4 no/unknown    no 15196    89.1%

And effectively we can see that there is no difference to be noticed comparing hasloan vs y variable.

Jobs vs Education

Since we have the information, we will check the relationship between job tittle and education.

This sample population respond to what we would expect :

  • Admin population has ~90% of >= high school education , with ~50-60% of university degree.
  • Blue-collar workers has ~60%-70% <= basic.9y education
  • management is composed by ~70% of university degrees profiles

How does age affect jobs category ?

Housing vs Marital situation

We wanted to see if marital situation has an impact on having a housing loan (filter(housing!=‘unknown’))

It seems not.

Default vs job

We want to unveil data about observation with default = yes. Let find out what proportion of the entire population is in that condition.

## [1] "Quantity of observations with variable default = yes : 3"

The population being to small, we cannot provide any insights about the defaulted population.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We tried to explore various relationships with the success/fail variable.

  • age vs y : we found a tendency of changing success rate by age, however we observed also a tendency of observation count with age. Further investigation must be done in order to test the significance of quantity of observations in the success/failure ‘y’ variable.
    Later boxploting age vs y do not show difference in the descriptive statistics for both groups (success/failure)

  • Then we tried ggpairs matrix, giving a hint to deeper in the macro socio economical variables .

  • ‘Macro social and economic Portuguese context data’ vs y : ex euribor3m : Boxplot unveil differences in the median of groups (although IQR for success/fail share 80% of the euribor3m values). Indeed later plotting of the proportion for each euribor3m level (bin width = 1) in an histogram show clear evidence of greater success rate in the lower levels of the euribor3m values (25% for the lower, 5% for the higher values). Similar findings result of the plotting of the other 4 Macro social and economic variables.
    Further analysis with an expert in those variables is due.

  • Then we compared campaign vs y , setting the hypothesis that more contact/call where indicative of better/higher chances of success (more time to be convinced with more arguments). We find that the opposite is true, with better conversion rates in the firsts calls. Further investigation should be done to understand if # of contacts is a good proxy for more contact with marketing agents (ex: process could be ‘only generates a new contact is previous one did not engage in actual conversation with the client’).

-Trying the actual duration variable, we observe a clear insight on a correlation between duration and y success. However this finding is not exploitable as is, because the value of duration is known only after the call (and its success/fail status) is ended.

  • Lastly, we investigated the relationship between loan owners (housing or personal) with the success rate (of subscription of term deposit == success yes ). Setting the hypothesis that a client already engaged in a bank service (a loan) would be more easily driven to acquire another service. We found no evidence of such tendency.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • As we did not find any strong correlation between success/failure variable and other variables, we started looking for relationships against other pair of variables. We found that this sample population respond to what we would expect of a job vs education analysis and job vs age.

  • We hypothesized on the marital housing variable vs marital status, trying to find a correlation. None was found.

  • Lastly, we wanted to inquire about the defaulted population, but the amount of observation was very small (3/40.000)

What was the strongest relationship you found?

The strongest relationship was found between the Macro economical variable and the success/failure variable. Professional input is required because at this level we do not understand the logic of

The marketing campaing for subscription to term deposit has better rates of sucess when euribor3m level are at their lowest (meaning less interest rates meaning less return on the capital deposited)

There is also the y vs duration relationship. but we are warned that it is not usable as is for predictions.

Multivariate Plots Section

Duration vs Euribor3m vs y

## [1] "We will categorize the Euribor3m variable and \n#make use of the bucket with >100 count to facet the plot"
## 
## (0,1] (1,2] (2,3] (3,4] (4,5] (5,6] 
##  3908  9590     0    14 27667     9

We confirm that Euribor3m has an effect on conversion as presented on bi variate analysis.

Education vs Age vs y

We do not observe any substantial variability based on education

Job vs Age vs y

Job vs Education vs y

First iteration

Second iteration :

We want to add the y variable information , so we need to create the information in order to be plotted

## Source: local data frame [177 x 6]
## Groups: job, education [90]
## 
##       job           education     y     n rel.freq_sucess total_obs
##    <fctr>              <fctr> <ord> <int>           <dbl>     <int>
## 1  admin.          illiterate    no     1          100.00         1
## 2  admin.            basic.4y   yes    10           12.99        77
## 3  admin.            basic.4y    no    67           87.01        77
## 4  admin.            basic.6y   yes     8            5.30       151
## 5  admin.            basic.6y    no   143           94.70       151
## 6  admin.            basic.9y   yes    42            8.42       499
## 7  admin.            basic.9y    no   457           91.58       499
## 8  admin.         high.school   yes   382           11.47      3329
## 9  admin.         high.school    no  2947           88.53      3329
## 10 admin. professional.course   yes    49           13.50       363
## # ... with 167 more rows

Rel.freq is the proportion of success of the y variable for each [Job_Education] combination. We need to remove one line of Y variable for each [Job_Education], luckily as y is binary, we can remove (and plot) just one line. Subset to remove the y==‘no’ is in the order as following

ggplot(data=subset(data_sum,data_sum$y=='yes'&
  data_sum$education!='illiterate'&data_sum$total_obs>100), aes(job, education))

The color palette shows the %rate for success ( variable y == yes)

Job vs Marital vs y

The color palette shows the %rate for success ( variable y == yes)

The marital status variable does not bring information, we can see this as the ‘color’ (meaning success rate) do not change with marital status.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section we compared different combination of variables

  • Duration vs Euribor3m vs y
  • Education vs Age vs y
  • Job vs Age vs y
  • Job vs Education vs y
  • Job vs Marital vs y

No new combination of variables showed pivotal information for the analysis, besides what we already knew form bi variate section.

The main variables to infer sucess/failure of the marketing campaign are the MacroEconomic (ex Euribor3m) or Duration

Were there any interesting or surprising interactions between features?

We did observe that client in the job category [student,retired] to have more appetite for the product. This is similar to our findings in our first arrival analysis (age vs y). What’s interesting in this plot ( variable vs job vs y) is that we can observe similar sized group (ex : [retired + married] vs [entrepreneur + married ]) but the success rate does indeed change positively towards extreme ages (retired+students). There is then another proof that age does influence success of the campaign


Final Plots and Summary

Plot One

Description One

This plot shows us quickly the success rate for the variable y on different ages. Bin width = 1 year. This plot was the first where we find a clear behavior. The concave curve could have been shown with other type of geom (maybe geom_polygon), but we find that the bar express correctly the binary outcome of the variable : success/fail, 1/0, T/F.

Plot Two

Description Two

We chose this plot because it’s when we started to feel comfortable with the exploration. Ggplot library was starting ti become an ally instead of a headache. Hear, we decided to convert Euribor3m variable into a categorical one, ordered in order to use it as a facet. We choose ncol = 1 in order to have each level in the same shared x axis. We made use of transformations for the x & y variables, in order to uncover the findings : Euribor3m and duration had a positive effect on success. As we can observe when fixing duration (thus looking at the facets by Euribor3m), the proportion of success is higher on the first level. Furthermore, when fixing Euribor3m (thus looking at the one of the above 3 plots), proportion of success is higher when duration (of the contact call) is extended.

Plot Three

## Source: local data frame [177 x 6]
## Groups: job, education [90]
## 
##       job           education     y     n rel.freq total_obs
##    <fctr>              <fctr> <ord> <int>    <dbl>     <int>
## 1  admin.          illiterate    no     1   100.00         1
## 2  admin.            basic.4y   yes    10    12.99        77
## 3  admin.            basic.4y    no    67    87.01        77
## 4  admin.            basic.6y   yes     8     5.30       151
## 5  admin.            basic.6y    no   143    94.70       151
## 6  admin.            basic.9y   yes    42     8.42       499
## 7  admin.            basic.9y    no   457    91.58       499
## 8  admin.         high.school   yes   382    11.47      3329
## 9  admin.         high.school    no  2947    88.53      3329
## 10 admin. professional.course   yes    49    13.50       363
## # ... with 167 more rows

Description Three

In this plot, we managed to produce exactly what was in mind. We had to create a group by Data Frame, and the subset it in order to get the Rel.freq information. In this plot we show that age (expressed by jobs : student & retired) indeed have an impact on the success of the campaign. We can see that for those 2 categories of jobs, success rate is higher.


Reflection

This project was very challenging, on so many levels.

Overall, this project felt like a real challenge and we learned tons from it. Now, machine learning!!!